Prediction of genomic properties and classification of life by protein length 

distributions 
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Much evolutionary information is stored in the fluctuations of protein length distributions. The 
genome size and non-coding DNA content can be calculated based only on the protein length 
distributions. So there is intrinsic relationship between the coding DNA size and non-coding DNA 
size. According to the correlations and quasi-periodicity of protein length distributions, we can 
classify life into three domains. Strong evidences are found to support the order in the structures 
of protein length distributions. 

PACS numbers: 87.10.+e 



INTRODUCTION 

There is an analogy between the current status of par- 
ticle physics and molecular biology, in each of which there 
is a theoretical framework to explain the particle world or 
living world (the Standard Model for the former while the 
Central Dogma and gene regulation for the latter), but 
we do not know the dynamical mechanisms that under- 
lie these theoretical frameworks. It is generally believed 
that the genetic information is stored in DNA sequences. 
This often results in a misapprehension that all the in- 
formation about a species is stored in the sequences of its 
genome. Actually, only a part of the information of life 
is stored in the sequences, from which we may infer the 
structures and functions of macromolecules. But there 
is still other information concerns the underlying mech- 
anism of the evolution of life, which can not be acquired 
by analyzing the sequences. For example, the mecha- 
nism that determines the protein length distribution dif- 
fers from the mechanism that determines in the protein 
sequences, which requires new explanation other than the 
current theoretical framework. 

The protein length distribution is an intrinsic proper- 
ties of a species that concerns the underlying mechanism 
of molecular evolution. But the nature of the protein 
length distribution is still unknown. Someone consid- 
ered it as a result of stochastic process, while others no- 
ticed the order in the length distributions [l[ 0] 0] [H ■ We 
found that much information of the evolution of life is 
stored in the fluctuations of protein length distributions. 
The genome size of a species, even its coding DNA size 
and non-coding DNA size can be calculated by the pro- 
tein length distribution of this species. We also found 
that the three domains of life (Bacteria, Archaea, and 
Eucarya) can be classified based on the correlations or 
quasi-periodicity of protein length distributions; further- 
more, we can obtain the phylogeny of the three domains 
[j| . These results shows that the protein length distribu- 
tions are not random, the fluctuations in the distribution 
are intrinsic properties and can be taken as the finger- 
print of a certain species. We propose a linguistic model 



to explain the generation of protein sequences and con- 
sequently try to explain the nature of protein length dis- 
tributions. The model indicates that the complexity of 
the grammars that generate protein sequences is related 
to the biological complexity of the species. The protein 
length distributions might be taken as clues to discover 
the underlying mechanism of molecular evolution. 



PREDICTION OF GENOME SIZE AND 
NON-CODING DNA CONTENT BY PROTEIN 
LENGTH DISTRIBUTIONS. 

The evolution of genome size is one of the central prob- 
lems in the study of molecular evolution [f| [3] ■ We can 
write down the experimental formulae of genome size and 
the ratio of non-coding DNA to coding DNA based on 
the protein length distributions, which may help us un- 
derstand the mechanism of genome size evolution. 

The protein length distribution of a species a can be 
denoted by a vector 



x(a) = (xi(a),x 2 (a),...,Xk(a),...), 



(1) 



where there are Xk(a) proteins in its entire proteome 
whose lengths are k amino acids (a.a.). Our data of 
the protein length distributions are obtained from the 
data of n = 106 complete proteomes (rib = 85 bacteria, 
n a = 12 archaea, n e = 7 eukaryotes and n v = 2 viruses) 
in the database Predictions for Entire Proteomes (PEP) 
[H . The total protein length distribution of all the species 
in PEP is X = ^2 a£PEP x(a), from which we found that 
there are few proteins longer than m = 3000 a.a. in the 
complete proteomes. We can neglect them and set m as 
the cutoff of protein length in our calculations. Hence the 
vectors x(a) can be reduced to m-dimensional vectors to 
represent the protein length distributions. 

A peak at protein length I in the fluctuations of pro- 
tein length distribution of species a can be denoted by 
Xi(a), which is required to be greater than both xi-±(a) 
and xi+\{a). The number of peaks in the fluctuations of 
protein length distribution of species a can be denoted 
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FIG. 1: Prediction of genome size and non-coding 
DNA content by the protein length distribtuions. a, 

The relationship between logarithm of genome size logs (a) 
and the number of peaks p(a) in the protein length distri- 
butions, b, The predictions of non-coding DNA contents by 
Eqn. 2 agree with the non-coding DNA contents of species. 



by p{a). We found that there is an exponential relation- 
ship between genome size s(a) and number of peaks 75(a) 
(Fig la) 



s(a) 



s exp( ), 



/'ii 



(2) 



where s' = 8.36 x 10 4 base pairs (bp) and po — 70.6. 

We had also obtained another formula to calculate the 
genome size [§] 



s(a) = 7.96 x 10 6 exp( 



77(a) 9(a) 



0.165 0.176' 



(3) 



where 77(a) is the non-coding DNA content in the com- 
plete genome of species a and the correlation polar angle 
is defined by 9(a) = ^ arccos( 1 pqO- ^ n terms of 
the above two experimental formulae of genome zise, we 
obtained a formula to calculate the non-coding DNA con- 
tent: 

77(a) = 0.752 + 0.938 9(a) - 0.00234 75(a), (4) 

where both 9 and p depend only on protein length distri- 
butions. There are 59 species in PEP whose non-coding 
DNA contents are known according to to Ref . [To] . The 
prediction of the non-coding DNA contents agrees with 
the experimental observations (Fig lb). 



Thus, we found that the genomic properties can be 
predicted based only on the protein length distributions. 
The information of genome size s(a), non-coding DNA 
size r](a)s(a) and coding DNA size (1 — r/(a))s(a) are 
stored in the fluctuations of protein length distribution of 
species a. More peaks in the fluctuations indicates lager 
genome size. And the number of peaks in the fluctuations 
also relates to the non-coding DNA content. The protein 
length distribution concerns only the coding DNA, but 
we can deduce the non-coding DNA size by such a distri- 
bution. This shows that there must be a universal mech- 
anism to adjust the ratio of coding DNA and non-coding 
DNA in each species. 

In some studies, the protein length are considered as 
random variable of a stochastic process. If so, there 
should not be close relationship between the number of 
peaks 75(a) and the genome size s(a); and the outline 
of the protein length distribution would be more smooth 
when the genome size increases. Our results show that 
the protein length is not a random variable of stochastic 
process and the fluctuations are intrinsic properties. 

The number of peaks of the fluctuations in the pro- 
tein length distribution 75(a) is an intrinsic property of 
a species a. We also found that the correlation between 
75(a) and s(a) is closer than the correlation between 75(a) 
and 1(a) (Fig lb and Fig 2c). It is suggested that the 
biological complexity is related to the non-coding DNA 
content, and the biological complexity is also related to 
the genome size for prokaryotes [11] 12] [13]. The num- 
ber of peaks 75(a) can be interpreted as the complexity of 
structures of protein length distribution. The linguistics 
plays significant roles in the organization of protein or 
DNA sequences 14 1 Ijj . We proposed a linguistic mech- 
anism to account for the generation of protein sequences 
[l6| . The bell-shaped outline and the fluctuations of the 
protein length distribution can be simulated by a lin- 
guistic model. The number of peaks of the fluctuations 
in the protein length distribution is determined by the 
complexity of grammars. The correlation between 75(a) 
and s(a) indicates a relationship between the complex- 
ity of the structure of protein length distribution and the 
biological complexity of that species, both of which may 
result from the complexity of grammars in the sequences. 



CLASSIFICATION OF LIFE BY CORRELATION 
AND QUASI-PERIODICITY OF PROTEIN 
LENGTH DISTRIBUTIONS. 

Molecular sequence analysis provides a more precise 
and profound method in classification of life than classi- 
cal taxonomy. Based upon rRNA sequence comparisons, 
life on this planet can be divided into three domains: the 
Bacteria, the Archaea, and the Eucarya The dif- 
ferences that separate the three domains are of a more 
profound nature than the differences that separate clas- 
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FIG. 2: Clustering analysis of life based on the corre- 
lation and quasi-periodicity of protein length distri- 
butions. The species in three domains cluster together in 
different areas in R — log D, I — log D and I — p planes respec- 
tively, a, Clustering analysis by average correlation R(a) and 
average Minkowski distance D(a). b, Clustering analysis by 
average protein length 1(a) and average Minkowski distance 
D(a). c, The relationship between average protein length 1(a) 
and the number of peaks p(a). d, The relationship between 
long period m/ fh(a) and short period m/ f m ax(a) in the pro- 
tein length distributions. We can observe parallel structures 
in the distributions of species. 



sical five kingdoms (Monera, Protista, Fungi, Plantae, 
Animalia) . The correlations and quasi-periodicity in the 
protein length distributions may originate in a general 
underlying mechanism of generation of protein sequences 
[ri| fig . According to the theory of multivariation data 
analysis, we can classify life in the three domains based 
only on the protein length distributions. 

The correlation coefficient of protein length distribu- 
tions between species a and (5 is defined by 



r(a,/3) 



EfcliOM") ~ x(a))(x k (P) - x(0)) 



vCT^) - WvSl - m? ' 

(5) 

where x(a) — — EfcLi x k{oi). And the average correla- 
tion coefficient of species a can be defined by 



R(a) 



1 

106 



E 

0GPEP 



r(a,(3). 



(6) 



We can also define the Minkowski distance between 
species a and (3 as 



d(a, /?) = (£ 



k=l 



x k (a) x k (P) 



l|x(a)|| ||x03)|| 



1/9 



(7) 



where we chose the parameter q = 1/4 in the calculation. 
And the average Minkowski distance of species a can be 
defined by 



D(a) 



1 
106 



pePEP 



(8) 



The average correlation coefficient R(a) and the average 
Minkowski distance D(a) represent the evolutionary re- 
lationship between species a and all the other species. 
The more the average correlation coefficient is (or in- 
versely the less the average Minkowski distance is), the 
closer the evolutionary relationship is. 

We can classify life by clustering analysis based only on 
the protein length distributions. In the R— \ogD plane, 
we found that the species in the three domains cluster 
together respectively (Fig 2a). The similar results can 
also be obtained by other methods. The average protein 
length in the proteome of species a can be calculated by 
protein length distribution: 



1(a) = 



(9) 



According to the distributions of species in the plots of 
I — R, T— \ogD and I — p, we found that the species in the 
three domains also cluster together in the corresponding 
plots respectively. 

We can also classify life according to the quasi- 
periodicity of protein length distributions. There are un- 
derlying orders in the organization of protein sequences, 
so we can observe the quasi-periodic structures in the pro- 
tein length distributions. According to Fourier analysis, 
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FIG. 3: Differences in the average Fourier spectra of 
the protein length distributions for three domains. 

The ranges for averaging are chosen as si = 100 (Blue lines) 
and s2 = 300 (Red lines) for the three domains respectively, 
a, The total average Fourier spectra A(sl) and A(s2) for Ar- 
chaea. b, The total average Fourier spectra E(sY) and E(s2) 
for Eucarya. c, The total average Fourier spectra -B(sl) and 
B(s2) for Bacteria. 



strongly indicates the intrinsic correlation among the pro- 
tein length distributions, which can not achieve if the 
protein length distributions are stochastic. The period- 
icity of the protein length distributions can be inferred 
by the correlations between fa and fmax- The regular 
distribution of species in Fig. 2d indicates the correla- 
tion between the short period m/fa and the long period 
m/ fmax- Therefore, the quasi-periodic structures of the 
spectra indicate the correlations between long proteins 
and short proteins in a proteome. 

The total spectra for three domains Bacteria, Ar- 
chaea and Eucarya are b = ^E a£ fl a ct™y( tt ). a = 

^EaEArcteayW and e = ^T, a eEucary a y( a ) re " 

spectively. So the average spectra for the three domains 
are defined as follows respectively: 

k=f-s 

Ms) = ^ £ «* ^ 

k=f-s 

E ( s ) = 2^T\ £ efc ' (13) 

k=f-s 

where / = 1 + s, m — s and we chose si = 100 and 
s2 = 300 as the ranges for averaging in calculations. We 
found that the outlines of A and E are similar, both 
of which have convex bottoms (Fig 3a, b), but they are 
dissimilar with the outline of B, which has a concave 
bottom (Fig 3c) . According to the phylogeny of the three 
domains, the relationship between Archaea and Eucarya 
is more closer than the relationship between Archaea and 
Bacteria. So the outlines of the average spectra of the 
three domains reveal the phylogeny of the three domains. 



we can also observe the clustering of species for different 
domains. 

The abstract discrete fourier transformation of the pro- 
tein length distribution x(a) is: 



y/(«) = -j= 



fe=i 



x k (a)e 



27r»(fc— l)/m 



(10) 



The frequency of the highest peak in the fluctuations of 
the spectrum y(a) can be denoted by fh{a)- The maxi- 
mum frequency for the top 10 highest peaks in the fluctu- 
ations of the spectrum y(a) can be denoted by f max (oi). 

We found that there is an interesting relationship be- 
tween the highest peak frequency fa(ct) and the aver- 
age protein length I of species (Fig. 3 in Ref. fl6j|). 
The distribution of species in fa — I plane shows a reg- 
ular pattern: the species in the three domains gathered 
in three rainbow-like arches respectively. This pattern 



CONCLUSION AND DISCUSSION 

We can conclude that much evolutionary information 
is stored in the protein length distributions, according to 
which we can calculate the genome size and non-coding 
DNA content. The mechanism that determines the pro- 
tein length distribution concerns not only coding DNA 
but also non-coding DNA. So we demonstrated the intrin- 
sic relationship between coding DNA size and non-coding 
DNA size. New methods are proposed to classify life into 
three domains based only on the protein length distribu- 
tions. We confirm that there are quasi-periodic struc- 
tures in the protein length distributions, and we found 
the relationship between average protein length and the 
frequencies of Fourier transformation of protein length 
distributions. 

The protein length distributions can not be considered 
as random fluctuations. We found strong evidences of the 
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underlying mechanism in the organization of amino acids 
in protein sequences, which indicates the languages in the 
protein sequences. We proposed a linguistic model to ac- 
cord for the protein length distribution, which suggested 
that there is a close relationship between the complexity 
of grammars of the protein sequences and the biological 
complexity of the species. 
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